Extracting Constraints on Word Usage from Large Text Corpora
نویسندگان
چکیده
Our research focuses on the identification of word usage constraints from large text corpora. Such constraints are important for natural language systems, both for the problem of selecting vocabulary for language generation and for disambiguating lexical meaning in interpretation. The first stage of our research involves the development of systems that can automatically extract such constraints from corpora and empirical methods for analyzing text. Identified constraints will be represented in a lexicon that will be tested computationally as part of a natural language system. We are also identifying lexical constraints for machine translation using the aligned Hansard corpus as training data and are identifying many-to-many word alignments.
منابع مشابه
Retrieving Collocations by Co-occurrences and Word Order Constraints
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method retrieve collocations in the following stages: 1) extracting strings of characters as units of collocations 2) extracting recurrent combinations of strings in accordance with their word order in a corpus as collocations. Through the method, various range of collocations, especially...
متن کاملRetrieving Domain-Specific Collocations by Co-occurrences and Word Order Constraints
In this paper, we describe a method for automatically retrieving collocations from large text corpora. This method comprises the following stages: (1) extracting strings of characters as units of collocations, and (2) extracting recurrent combinations of strings as collocations. Through this method, various types of domain-specific collocations can be retrieved simultaneously. This method is pr...
متن کاملResolving Translation Ambiguity Using Non-Parallel Bilingual Corpora
This paper presents an unsupervised method for choosing the correct translation of a word in context. It learns disambiguation information from nonparallel bilinguM corpora (preferably in the same domain) free from tagging. Our method combines two existing unsupervised disambiguation algorithms: a word sense disambiguation algorithm based on distributional clustering and a translation disambigu...
متن کاملWord Co-occurrence Counts Prediction for Bilingual Terminology Extraction from Comparable Corpora
Methods dealing with bilingual lexicon extraction from comparable corpora are often based on word co-occurrence observation and are by essence more effective when using large corpora. In most cases, specialized comparable corpora are of small size, and this particularity has a direct impact on bilingual terminology extraction results. In order to overcome insufficient data coverage and to make ...
متن کاملExtracting Semantic Representations from Large Text Corpora
Many connectionist language processing models have now reached a level of detail at which more realistic representations of semantics are required. In this paper we discuss the extraction of semantic representations from the word co-occurrence statistics of large text corpora and present a preliminary investigation into the validation and optimisation of such representations. We find that there...
متن کامل